[Benchmark]: Add model_config sweep mode and model registry by noemotiovon · Pull Request #1180 · linkedin/Liger-Kernel

noemotiovon · 2026-04-01T09:54:42Z

Add Qwen 2.5 models (7B / 14B / 72B) and DeepSeek models (V2 Lite / V3) to MODEL_REGISTRY
Add model_config sweep support to all 33 benchmark scripts, enabling benchmarks to sweep across different model architectures at a fixed sequence length
Refactor benchmark scripts by extracting helper functions:
- setup*
- resolve_model_config* to improve code reuse and keep implementations cleaner across sweep modes
Add grouped bar chart visualization in benchmarks_visualizer for model_config sweep results

Hardware Type: Atlas 800I A2

run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

- Add Qwen 2.5 models (7B / 14B / 72B) and DeepSeek models (V2 Lite / V3) to MODEL_REGISTRY - Add model_config sweep support to all 33 benchmark scripts, enabling benchmarks to sweep across different model architectures at a fixed sequence length - Refactor benchmark scripts by extracting helper functions: - _setup_* - _resolve_model_config_* to improve code reuse and keep implementations cleaner across sweep modes - Add grouped bar chart visualization in benchmarks_visualizer for model_config sweep results

noemotiovon · 2026-04-03T02:28:45Z

Benchmark Framework Design

This document describes the overall design of the Liger-Kernel benchmark suite, including its two benchmark dimensions, the shared infrastructure, and the phased implementation plan.

1. Benchmark Dimensions

Every operator should ideally be benchmarked along two orthogonal dimensions:

Dimension	x-axis	Fixed	CLI	Goal
D1: Non-model dimension sweep	sequence length, BT, etc.	model config	`--model`	Performance scaling across different input sizes
D2: Model dimension sweep	hidden_size, model configs, etc.	token count	`--bt`	Performance scaling across different model architectures

D1: Non-model dimension sweep (implemented)

Sweep non-model dimensions (e.g. sequence length, BT) with a fixed model config selected via --model. This is the default behavior for all benchmark scripts.

x_values:  [1024, 2048, 4096, 8192, ...]     (seq_len or BT)
fixed:     model=llama_3_8b via --model       (hidden_size=4096, intermediate_size=14336, ...)
output:    line chart — speed/memory vs token length

D2: Model dimension sweep (implemented)

Sweep model architecture dimensions (e.g. hidden_size, or discrete model configs from MODEL_REGISTRY) with a fixed token count set via --bt. This reveals how kernel performance compares across different model architectures at the same input scale.

x_values:  [llama_2_7b, llama_3_8b, ...]     (discrete model configs)
fixed:     BxT via --bt                       (determined to be safe across all configs)
output:    speedup or throughput bar chart per model config

2. D2 Design Choices

Following the maintainer discussion, we evaluated three approaches:

Approach	Description	Pros	Cons
A: Per-parameter sweep	Fix all but one model param, sweep it (e.g. sweep hidden_size with fixed intermediate_size, then vice versa)	Shows per-parameter scaling trend	Combinatorial; fixed values are arbitrary; model-dependent
B: N-dimensional scan	Vary all model parameters simultaneously	Most comprehensive data	Impractical runtime; data bloat
C: Discrete model configs	Run each `MODEL_REGISTRY` entry as one data point	Cost-efficient; realistic configs; easy to maintain	No continuous scaling trend

Decision: C as the primary approach, with A as optional enrichment for ops where single-parameter scaling is important.

Rationale:

C uses real-world model architectures, making results directly meaningful.
C naturally extends the existing MODEL_REGISTRY infrastructure.
C produces clean bar charts (speedup/throughput) that align across ops.
A can be layered on later for specific ops where parameter-level scaling trends matter.

3. Universal Token Length for D2

For D2 benchmarks, we need a fixed token-length that is safe (no OOM) across all model configs and all operators.

Strategy

Coupled tokens: define a (batch_size, seq_len) pair per model config, e.g. (B=2, T=1024) for llama_2_7b, (B=1, T=1024) for llama_3_8b. This allows adapting to each model's memory footprint while keeping the comparison fair (same total token count or same seq_len).
Safety via probe: before running D2 benchmarks, run estimate_kernel_peak_memory for each (model_config, token_config) pair. If any config would OOM, reduce token count automatically.
Forward compatibility: when new ops are added, the probe mechanism ensures safety without manual tuning. If a new op cannot fit any reasonable token count for a given model, the framework skips that data point with a warning rather than crashing.

Proposed CLI

# D1 (existing): token-length sweep with fixed model
python benchmark_geglu.py --model llama_3_8b

# D2 (new): model-config sweep with fixed token length
python benchmark_geglu.py --sweep-mode model_config --bt 2048

The --sweep-mode flag selects the dimension. Default remains token_length (D1) for backward compatibility.

4. Infrastructure Changes

4.1 New config type

@dataclass(frozen=True)
class ModelConfigSweepConfig:
    """Config for D2 benchmarks that sweep across model configs."""
    model_configs: List[ModelConfig]       # models to benchmark
    bt: int                                 # fixed batch * seq_len
    batch_size: int                         # safe batch size
    seq_len: int                            # safe seq_len

4.2 New helper

def compute_model_config_sweep_config(
    model_configs: List[ModelConfig],
    probe_fn_factory: Callable[[ModelConfig, int], Callable[[], torch.Tensor]],
    bt: int = 2048,
    memory_utilization: float = 0.4,
) -> ModelConfigSweepConfig:
    """Find safe (batch_size, seq_len) that works across all model configs.

    For each model config, runs probe_fn_factory(model_config, bt) to measure
    peak memory, then picks the most conservative batch_size / seq_len.
    """
    ...

4.3 Script-level changes

Each benchmark script gains a model-config sweep code path gated by --sweep-mode:

if args.sweep_mode == "model_config":
    configs = [MODEL_REGISTRY[name] for name in MODEL_REGISTRY]
    sweep = compute_model_config_sweep_config(configs, probe_fn_factory=..., bt=args.bt)
    # x_values = model config indices
    # extra_benchmark_configs = contains all model configs
    ...
else:
    # existing token-length sweep logic
    ...

4.4 Visualization

D2 results produce grouped bar charts (speedup or throughput) rather than line charts:

x-axis: model config names (e.g. llama_2_7b, llama_3_8b)
bars: kernel providers (liger vs huggingface/torch)
y-axis: speedup ratio or throughput (tokens/s)

5. Phased Implementation Plan

Phase 1: Foundation (current PR)

Status: complete

ModelConfig and MODEL_REGISTRY with canonical model profiles
estimate_kernel_peak_memory() for runtime memory probing
compute_seq_len_sweep_config() for D1 seq_len sweeps (non-model dimension)
compute_hidden_size_sweep_config() for D2 hidden_size sweeps (model dimension)
run_speed_benchmark() / run_memory_benchmark() shared helpers
--model CLI argument
Refactored benchmark_geglu.py, benchmark_swiglu.py, benchmark_dyt.py
BENCHMARK_GUIDELINES.md contributor guide

Phase 2: Model-config sweep (D2)

Status: complete

Add ModelConfigSweepConfig dataclass
Implement compute_model_config_sweep_config() with cross-model probe
Add --sweep-mode and --bt CLI arguments to parse_benchmark_script_args()
Add model-config sweep code path to benchmark_geglu.py as reference implementation
Model-config sweep code path ported to benchmark_swiglu.py and benchmark_dyt.py
Validate on at least 2 devices (CUDA + NPU) to confirm OOM safety
Update BENCHMARK_GUIDELINES.md with D2 instructions

Phase 3: Rollout and visualization

Status: in progress

Add bar chart / speedup visualization for D2 results
Integrate into CI workflows

Phase 3 Kernel Rollout Tracking

Already refactored (D1 + D2):

benchmark_geglu.py
benchmark_swiglu.py
benchmark_dyt.py

Norm-like kernels (input: BT × hidden_size):

Loss kernels (input involves vocab_size or similar):

RLHF/alignment loss kernels:

Positional encoding kernels:

benchmark_rope.py
benchmark_qwen2vl_mrope.py
benchmark_llama4_rope.py

Activation / misc kernels:

benchmark_softmax.py
benchmark_sparsemax.py
benchmark_embedding.py
benchmark_relu_squared.py

Attention kernels:

benchmark_fused_neighborhood_attention.py
benchmark_sparse_multi_token_attention.py
benchmark_multi_token_attention.py
benchmark_attn_res.py

Other:

benchmark_tiled_mlp.py
benchmark_mhc.py
benchmark_mhc_lm.py

6. Directory Structure

benchmark/
├── data/
│   └── all_benchmark_data.csv
├── scripts/
│   ├── benchmark_model_configs.py      # ModelConfig, MODEL_REGISTRY, helpers
│   ├── utils.py                        # run_benchmarks, CSV, CLI
│   ├── benchmark_geglu.py              # D1 + D2
│   ├── benchmark_swiglu.py             # D1 + D2
│   ├── benchmark_dyt.py                # D1 + D2
│   └── ...
├── visualize/                          # (Phase 3) chart generation
│   └── ...
└── BENCHMARK_GUIDELINES.md             # contributor guide

lowdy1 · 2026-04-03T08:05:03Z

-
-    This module re-computes forward in the backward, so forward occurs twice per iteration.
-    """
-


maybe we could keep these comments

lowdy1 · 2026-04-07T03:17:14Z

-        dtype: torch.dtype,
-        device: str,
-    ):
+    def __init__(self, mhc_cls, *, hidden_size, hc, num_heads, intermediate_mult, tmax, dtype, device):


lowdy1 · 2026-04-07T03:18:32Z

-        tmax: int,
-        dtype: torch.dtype,
-        device: str,
+        self, mhc_cls, *, vocab_size, hidden_size, hc, num_layers, num_heads, intermediate_mult, tmax, dtype, device


lowdy1 · 2026-04-07T03:18:46Z

-    tmax: int,
-    dtype: torch.dtype,
-):
+def _build_model(provider, *, hidden_size, hc, num_layers, num_heads, intermediate_mult, vocab_size, tmax, dtype):


lowdy1 · 2026-04-07T03:25:06Z

-    Uses the DeepSpeed TiledMLP algorithm for memory-efficient MLP computation.
-    """
-
    def __init__(self, config, num_shards=None):


lowdy1 · 2026-04-07T06:21:03Z

+                }
+            ],
+            "overwrite": args.overwrite,
+        }


We have built a general class BenchMiniMHCLM to test in this benchmark

lowdy1 · 2026-04-07T06:24:19Z

+        bias=bias,
+        dtype=dtype,
+        device=device,
    )


lowdy1 · 2026-04-07T06:26:56Z

+        groups=groups,
+        bias=bias,
+        dtype=dtype,
+        device=device,


lowdy1 · 2026-04-07T06:30:22Z

-                "bias": True,
-                "dtype": torch.bfloat16,
-            },
-        ],


we have dropped too many extra configs here

lowdy1 · 2026-04-07T06:34:36Z

-        extra_benchmark_configs=[
-            {"M": 2048, "dtype": torch.float32},
-            {"M": 2048, "dtype": torch.bfloat16},
-        ],


two kinds of dtype here

lowdy1 · 2026-04-07T06:47:40Z

+    if args.sweep_mode == "model_config":
+        all_model_configs = list(MODEL_REGISTRY.values())
+        T = 512
+        BT = 2048


BT is too small compared to the current one

lowdy1 · 2026-04-07T06:48:57Z

-            {"B": 32, "T": 512, "D": 768, "dtype": torch.float32},
-            # Llama
-            {"B": 8, "T": 2048, "D": 4096, "dtype": torch.float32},
-        ],


here we already have a bert-like model and a llama-like model

lowdy1 · 2026-04-07T06:50:13Z

+    else:
+        model = get_benchmark_model_config(args.model)
+        T = 512
+        probe_bt = 2048


too small BT

lowdy1 · 2026-04-07T06:53:40Z

-        torch.randn_like(q, device=device, dtype=dtype),
-        torch.randn_like(k, device=device),
-    )
+    dq, dk = torch.randn_like(q, device=device, dtype=dtype), torch.randn_like(k, device=device)


lowdy1 · 2026-04-07T06:57:19Z

-        torch.randn_like(q, device=device, dtype=dtype),
-        torch.randn_like(k, device=device, dtype=dtype),
-    )
+    dq, dk = torch.randn_like(q, device=device, dtype=dtype), torch.randn_like(k, device=device, dtype=dtype)


lowdy1 · 2026-04-07T06:57:34Z

-            rep=400,
-            quantiles=QUANTILES,
-        )
+        ms_50, ms_20, ms_80 = triton.testing.do_bench(fwd_fn, grad_to_none=[q, k], rep=400, quantiles=QUANTILES)


lowdy1 · 2026-04-07T06:58:45Z

+            "x_name": "T",
+            "x_label": "sequence length",
+            "x_values": [2**i for i in range(10, int(math.log2(max(1024, config.seq_len))) + 1)],
+            "kernel_providers": ["liger", "huggingface"],


lowdy1 · 2026-04-07T07:00:03Z

-    )
+    q = torch.randn((1, seq_len, num_q_heads, head_dim), device=device, requires_grad=True, dtype=dtype)
+    k = torch.randn((1, seq_len, num_kv_heads, head_dim), device=device, requires_grad=True, dtype=dtype)
+    dq, dk = torch.randn_like(q, device=device, dtype=dtype), torch.randn_like(k, device=device)


lowdy1 · 2026-04-07T07:00:16Z

-            rep=400,
-            quantiles=QUANTILES,
-        )
+        ms_50, ms_20, ms_80 = triton.testing.do_bench(fwd_fn, grad_to_none=[q, k], rep=400, quantiles=QUANTILES)


lowdy1 · 2026-04-07T07:48:39Z

-        ignore_index: int = -100,
-        beta: float = 0.1,
-    ):
+    def __init__(self, H, V, dtype, use_bias=False, use_ref_bias=False, ignore_index=-100, beta=0.1):


lowdy1 · 2026-04-07T07:48:49Z

-            beta=beta,
-            use_ref_model=True,
-        ).get_batch_loss_metrics
+        self.KTO_loss = HFKTOLoss(ignore_index=ignore_index, beta=beta, use_ref_model=True).get_batch_loss_metrics


lowdy1 · 2026-04-07T07:48:59Z

-        ignore_index: int = -100,
-        beta: float = 0.1,
-    ):
+    def __init__(self, H, V, dtype, use_bias=False, use_ref_bias=False, ignore_index=-100, beta=0.1):


lowdy1 · 2026-04-07T08:13:04Z

-            rep=100,
-            quantiles=QUANTILES,
-        )
+        ms_50, ms_20, ms_80 = triton.testing.do_bench(fwd, rep=100, quantiles=QUANTILES)


lowdy1 · 2026-04-07T08:13:12Z

-                rep=100,
-                quantiles=QUANTILES,
-            )
+            ms_50, ms_20, ms_80 = triton.testing.do_bench(fwd, rep=100, quantiles=QUANTILES)


lowdy1 · 2026-04-07T08:19:21Z

-            rep=100,
-            quantiles=QUANTILES,
-        )
+        ms_50, ms_20, ms_80 = triton.testing.do_bench(fwd, rep=100, quantiles=QUANTILES)


lowdy1 · 2026-04-07T08:26:03Z

-            rep=100,
-            quantiles=QUANTILES,
-        )
+        ms_50, ms_20, ms_80 = triton.testing.do_bench(full, rep=100, quantiles=QUANTILES)


lowdy1 · 2026-04-07T08:33:51Z

-            rep=100,
-            quantiles=QUANTILES,
-        )
+        ms_50, ms_20, ms_80 = triton.testing.do_bench(fwd, rep=100, quantiles=QUANTILES)


@Tcc0403

…uadratic scaling support (linkedin#1218) ## Summary Refs linkedin#1200. Addresses non-linear memory scaling in benchmark sweep config inference. The existing `compute_seq_len_sweep_config` inverts memory via `max_tokens = usable_bytes / kernel_bytes_per_token`, which only holds for linear-scaling kernels. For O(L²) kernels (e.g. `benchmark_sparse_multi_token_attention.py`), this overestimates capacity by orders of magnitude — the existing workaround there divides by `probe_L * probe_L`, but the downstream sweep math still treats the result as linear bytes-per-token. Per discussion on the issue (linkedin#1200 (comment)), this PR adds a new helper rather than threading `scaling_method` through the existing function — 16+ benchmark scripts call `estimate_kernel_peak_memory` today, and a wider signature change would conflict with in-flight benchmark refactors (linkedin#1199, linkedin#1180). Linear-scaling callers are unchanged; only quadratic-scaling benchmarks opt in. ### What changed - **`benchmark/scripts/benchmark_model_configs.py`** — adds `compute_seq_len_sweep_config_with_probe(model_cfg, probe_fn, probe_seq_len, probe_batch_size=1, scaling_method="linear" | "quadratic", ...)`. Internalizes the probe call + inversion; reuses `estimate_kernel_peak_memory` for the measurement. - **`benchmark/scripts/benchmark_sparse_multi_token_attention.py`** — switches the `token_length` sweep mode to the new helper with `scaling_method="quadratic"`, dropping the manual `peak_bytes // (probe_L * probe_L)` workaround. `estimate_kernel_peak_memory` and `compute_seq_len_sweep_config` are untouched. ## Validation Hardware: A10G 24GB (g5.xlarge). Synthetic O(L²) probe (B=2, L=2048, allocates `B * L * L` floats) using `LLAMA_3_8B` config and `max_seq_len=2**20` to bypass the model cap so the raw inversion is visible: ``` quadratic: SeqLenSweepConfig(batch_size=2, seq_len=8192) linear: SeqLenSweepConfig(batch_size=2, seq_len=65536) ``` The 8× gap (≈17× before snap-to-power-of-2) demonstrates the inversion difference: `linear` claims a sweep at L=65536 fits, when in reality L² at that size would require multiple TBs. `quadratic` lands at a realistic L=8192. This matches the issue's premise — for non-linear-scaling kernels, the existing inversion overestimates capacity and would OOM at the predicted boundary. ## Testing Done - [x] Synthetic O(L²) sanity check on A10G — confirms `quadratic` predicts L=8192 vs `linear` predicts L=65536 for the same probe (8× separation, scales as expected). - [x] `benchmark_sparse_multi_token_attention.py` imports + helper resolution verified locally. - [ ] Full sparse-attention end-to-end sweep on A10G (deferred — synthetic test already isolates the inversion math from kernel-specific noise). cc @Tcc0403

make checkstyle

2f68437

lowdy1 reviewed Apr 3, 2026

View reviewed changes

lowdy1 reviewed Apr 7, 2026

View reviewed changes

This was referenced May 7, 2026

[Benchmark] Benchmark probe issue, memory probe and estimation may be inaccurate for some kernels #1200

Open

[Benchmark] Add compute_seq_len_sweep_config_with_probe with linear/quadratic scaling support #1218

Merged


		This module re-computes forward in the backward, so forward occurs twice per iteration.
		"""

Conversation

noemotiovon commented Apr 1, 2026

Uh oh!

noemotiovon commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Framework Design

1. Benchmark Dimensions

D1: Non-model dimension sweep (implemented)

D2: Model dimension sweep (implemented)

2. D2 Design Choices

3. Universal Token Length for D2

Strategy

Proposed CLI

4. Infrastructure Changes

4.1 New config type

4.2 New helper

4.3 Script-level changes

4.4 Visualization

5. Phased Implementation Plan

Phase 1: Foundation (current PR)

Phase 2: Model-config sweep (D2)

Phase 3: Rollout and visualization

Phase 3 Kernel Rollout Tracking

6. Directory Structure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowdy1 Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowdy1 Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowdy1 Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowdy1 Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowdy1 Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowdy1 Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowdy1 Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowdy1 Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowdy1 Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lowdy1 Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

noemotiovon commented Apr 3, 2026 •

edited

Loading

lowdy1 Apr 7, 2026 •

edited

Loading

lowdy1 Apr 7, 2026 •

edited

Loading

lowdy1 Apr 7, 2026 •

edited

Loading

lowdy1 Apr 7, 2026 •

edited

Loading

lowdy1 Apr 7, 2026 •

edited

Loading

lowdy1 Apr 7, 2026 •

edited

Loading

lowdy1 Apr 7, 2026 •

edited

Loading

lowdy1 Apr 7, 2026 •

edited

Loading

lowdy1 Apr 7, 2026 •

edited

Loading

lowdy1 Apr 7, 2026 •

edited

Loading

lowdy1 Apr 7, 2026 •

edited

Loading